feat: runner leader election — ctl-api + SDK#410
Draft
RealHarshThakur wants to merge 13 commits intomainfrom
Draft
feat: runner leader election — ctl-api + SDK#410RealHarshThakur wants to merge 13 commits intomainfrom
RealHarshThakur wants to merge 13 commits intomainfrom
Conversation
|
This PR was marked as stale, and will be closed after 3 more days. Add the #keep-open label to prevent this from being closed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Runner Leader Election
Problem
When multiple runners exist in a RunnerGroup, all of them process jobs concurrently. There's no coordination to ensure only one runner handles work at a time, which can cause duplicate processing and unpredictable behavior.
We want to have leader election to solve for:
Solution
In this PR, I've focused on the automatic hands off experience to pick up the local dev runner.
The control plane (ctl-api) now elects a single leader runner per group. Only the leader processes jobs — all other runners sit in quiet standby.
How it works
Data model
A single leader_runner_id column on runner_groups (nullable FK to runners.id). NULL means no healthy runner is available.
Taints
We also introduced the concept of "Taints" for a smooth dev workflow as when we're local , we not only want the local runner to be picked but also on restarts of the runner, don't want jobs to be picked by the cloud runner. Tainting a Runner ensures that doesn't happen.
Election logic
Picks the oldest active runner by created_at (deterministic, avoids churn). There's an event loop that runs every minute to see if the leader should re-elected.
We've docs in a separate PR but might help in this PR's context to go through them.
Implements runner leader election across the control API and SDK layers.
Stack: